%pip install plotly
%pip install "jupyterlab>=3" "ipywidgets>=7.6"
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook'
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.14.1) Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from plotly) (23.0) Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.2.2) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: jupyterlab>=3 in /opt/conda/lib/python3.10/site-packages (3.5.3) Requirement already satisfied: ipywidgets>=7.6 in /opt/conda/lib/python3.10/site-packages (8.0.4) Requirement already satisfied: jinja2>=2.1 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (3.1.2) Requirement already satisfied: tomli in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (2.0.1) Requirement already satisfied: jupyter-server<3,>=1.16.0 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (2.1.0) Requirement already satisfied: jupyterlab-server~=2.10 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (2.16.6) Requirement already satisfied: ipython in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (8.8.0) Requirement already satisfied: notebook<7 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (6.5.2) Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (5.1.4) Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (23.0) Requirement already satisfied: nbclassic in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (0.4.8) Requirement already satisfied: tornado>=6.1.0 in /opt/conda/lib/python3.10/site-packages (from jupyterlab>=3) (6.2) Requirement already satisfied: widgetsnbextension~=4.0 in /opt/conda/lib/python3.10/site-packages (from ipywidgets>=7.6) (4.0.5) Requirement already satisfied: traitlets>=4.3.1 in /opt/conda/lib/python3.10/site-packages (from ipywidgets>=7.6) (5.8.1) Requirement already satisfied: jupyterlab-widgets~=3.0 in /opt/conda/lib/python3.10/site-packages (from ipywidgets>=7.6) (3.0.5) Requirement already satisfied: ipykernel>=4.5.1 in /opt/conda/lib/python3.10/site-packages (from ipywidgets>=7.6) (6.20.2) Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (5.9.4) Requirement already satisfied: comm>=0.1.1 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (0.1.2) Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (25.0.0) Requirement already satisfied: debugpy>=1.0 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (1.6.6) Requirement already satisfied: matplotlib-inline>=0.1 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (0.1.6) Requirement already satisfied: nest-asyncio in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (1.5.6) Requirement already satisfied: jupyter-client>=6.1.12 in /opt/conda/lib/python3.10/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.6) (7.4.9) Requirement already satisfied: pickleshare in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (0.7.5) Requirement already satisfied: decorator in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (5.1.1) Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.11 in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (3.0.36) Requirement already satisfied: backcall in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (0.2.0) Requirement already satisfied: pexpect>4.3 in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (4.8.0) Requirement already satisfied: stack-data in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (0.6.2) Requirement already satisfied: pygments>=2.4.0 in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (2.14.0) Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.10/site-packages (from ipython->jupyterlab>=3) (0.18.2) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2>=2.1->jupyterlab>=3) (2.1.2) Requirement already satisfied: send2trash in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.8.0) Requirement already satisfied: terminado>=0.8.3 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.17.1) Requirement already satisfied: websocket-client in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.4.2) Requirement already satisfied: anyio<4,>=3.1.0 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (3.6.2) Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.16.0) Requirement already satisfied: jupyter-events>=0.4.0 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.6.3) Requirement already satisfied: nbformat>=5.3.0 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (5.7.3) Requirement already satisfied: jupyter-server-terminals in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.4.4) Requirement already satisfied: nbconvert>=6.4.4 in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (7.2.8) Requirement already satisfied: argon2-cffi in /opt/conda/lib/python3.10/site-packages (from jupyter-server<3,>=1.16.0->jupyterlab>=3) (21.3.0) Requirement already satisfied: platformdirs>=2.5 in /opt/conda/lib/python3.10/site-packages (from jupyter-core->jupyterlab>=3) (2.6.2) Requirement already satisfied: json5>=0.9.0 in /opt/conda/lib/python3.10/site-packages (from jupyterlab-server~=2.10->jupyterlab>=3) (0.9.5) Requirement already satisfied: jsonschema>=3.0.1 in /opt/conda/lib/python3.10/site-packages (from jupyterlab-server~=2.10->jupyterlab>=3) (4.16.0) Requirement already satisfied: babel>=2.10 in /opt/conda/lib/python3.10/site-packages (from jupyterlab-server~=2.10->jupyterlab>=3) (2.11.0) Requirement already satisfied: requests>=2.28 in /opt/conda/lib/python3.10/site-packages (from jupyterlab-server~=2.10->jupyterlab>=3) (2.28.2) Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.10/site-packages (from notebook<7->jupyterlab>=3) (0.2.0) Requirement already satisfied: notebook-shim>=0.1.0 in /opt/conda/lib/python3.10/site-packages (from nbclassic->jupyterlab>=3) (0.2.2) Requirement already satisfied: idna>=2.8 in /opt/conda/lib/python3.10/site-packages (from anyio<4,>=3.1.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (3.4) Requirement already satisfied: sniffio>=1.1 in /opt/conda/lib/python3.10/site-packages (from anyio<4,>=3.1.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.3.0) Requirement already satisfied: pytz>=2015.7 in /opt/conda/lib/python3.10/site-packages (from babel>=2.10->jupyterlab-server~=2.10->jupyterlab>=3) (2022.7.1) Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/conda/lib/python3.10/site-packages (from jedi>=0.16->ipython->jupyterlab>=3) (0.8.3) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (0.19.3) Requirement already satisfied: attrs>=17.4.0 in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (22.2.0) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from jupyter-client>=6.1.12->ipykernel>=4.5.1->ipywidgets>=7.6) (2.8.2) Requirement already satisfied: entrypoints in /opt/conda/lib/python3.10/site-packages (from jupyter-client>=6.1.12->ipykernel>=4.5.1->ipywidgets>=7.6) (0.4) Requirement already satisfied: python-json-logger>=2.0.4 in /opt/conda/lib/python3.10/site-packages (from jupyter-events>=0.4.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.0.4) Requirement already satisfied: rfc3339-validator in /opt/conda/lib/python3.10/site-packages (from jupyter-events>=0.4.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.1.4) Requirement already satisfied: rfc3986-validator>=0.1.1 in /opt/conda/lib/python3.10/site-packages (from jupyter-events>=0.4.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.1.1) Requirement already satisfied: pyyaml>=5.3 in /opt/conda/lib/python3.10/site-packages (from jupyter-events>=0.4.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (6.0) Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.5.0) Requirement already satisfied: beautifulsoup4 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (4.11.1) Requirement already satisfied: mistune<3,>=2.0.3 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.0.4) Requirement already satisfied: jupyterlab-pygments in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.2.2) Requirement already satisfied: tinycss2 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.2.1) Requirement already satisfied: defusedxml in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.7.1) Requirement already satisfied: nbclient>=0.5.0 in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.7.2) Requirement already satisfied: bleach in /opt/conda/lib/python3.10/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (6.0.0) Requirement already satisfied: fastjsonschema in /opt/conda/lib/python3.10/site-packages (from nbformat>=5.3.0->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.16.2) Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.10/site-packages (from pexpect>4.3->ipython->jupyterlab>=3) (0.7.0) Requirement already satisfied: wcwidth in /opt/conda/lib/python3.10/site-packages (from prompt-toolkit<3.1.0,>=3.0.11->ipython->jupyterlab>=3) (0.2.6) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.28->jupyterlab-server~=2.10->jupyterlab>=3) (2.1.1) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.28->jupyterlab-server~=2.10->jupyterlab>=3) (2022.12.7) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.28->jupyterlab-server~=2.10->jupyterlab>=3) (1.26.14) Requirement already satisfied: argon2-cffi-bindings in /opt/conda/lib/python3.10/site-packages (from argon2-cffi->jupyter-server<3,>=1.16.0->jupyterlab>=3) (21.2.0) Requirement already satisfied: pure-eval in /opt/conda/lib/python3.10/site-packages (from stack-data->ipython->jupyterlab>=3) (0.2.2) Requirement already satisfied: executing>=1.2.0 in /opt/conda/lib/python3.10/site-packages (from stack-data->ipython->jupyterlab>=3) (1.2.0) Requirement already satisfied: asttokens>=2.1.0 in /opt/conda/lib/python3.10/site-packages (from stack-data->ipython->jupyterlab>=3) (2.2.1) Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from asttokens>=2.1.0->stack-data->ipython->jupyterlab>=3) (1.16.0) Requirement already satisfied: jsonpointer>1.13 in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (2.3) Requirement already satisfied: webcolors>=1.11 in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (1.13) Requirement already satisfied: isoduration in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (20.11.0) Requirement already satisfied: fqdn in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (1.5.1) Requirement already satisfied: uri-template in /opt/conda/lib/python3.10/site-packages (from jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (1.2.0) Requirement already satisfied: cffi>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from argon2-cffi-bindings->argon2-cffi->jupyter-server<3,>=1.16.0->jupyterlab>=3) (1.15.1) Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.10/site-packages (from beautifulsoup4->nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.3.2.post1) Requirement already satisfied: webencodings in /opt/conda/lib/python3.10/site-packages (from bleach->nbconvert>=6.4.4->jupyter-server<3,>=1.16.0->jupyterlab>=3) (0.5.1) Requirement already satisfied: pycparser in /opt/conda/lib/python3.10/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->jupyter-server<3,>=1.16.0->jupyterlab>=3) (2.21) Requirement already satisfied: arrow>=0.15.0 in /opt/conda/lib/python3.10/site-packages (from isoduration->jsonschema>=3.0.1->jupyterlab-server~=2.10->jupyterlab>=3) (1.2.3) Note: you may need to restart the kernel to use updated packages.
In order to perform analysis on colleges and their surrounding regions, we needed to find some subset of colleges, a dataset with characteristics of those colleges on a yearly basis, and then a dataset with characteristics of their nearby geographical areas. again yearly.
Initially, we decided to limit our analysis to the top 100 universities in the country according to current US News rankings, under the assumption that more highly ranked universities might have a more significant impact on their respective communities. We used Andy Reiter's “U.S. News & World Report Historical Liberal Arts College and University Rankings” dataset (citation).
In order to obtain college characteristics, we discovered that the Department of Education has extensive data available on accredited universities called the College Scorecard, which has a public API for programatically querying data.
In order to obtain characteristics of the region around each university, we needed a dataset that would contain demographic and economic data for defined geographical regions associated with the location of the University. We found that the American Community Survey yearly data from the Census had the housing cost and income data we wanted to analyze, and that its Public Use Microdata API from the census allowed us to programatically request that data for geographical groupings called "Public Use Microdata Areas," which are the smallest geographical entities that the Census collects yearly data from.
Since we queried a substantial amount of data from ridiculously large datasets, and requesting federal data from the Department of Education and the Census required registration for and usage of API keys, we decided that on top of the source datasets that we were able to download in full, stored in our repository under ETL/source_data, we would create modules for making federal API requests and loading the results into csv files for usage later.
Dataframes that we generated from data that we queried were stored under ETL/generated_data as csv, and then loaded into the notebook when needed, specifically: we built ScorecardData.csv using our scorecard_client.py module, which defines a CollegeScorecardClient object that can be used to query DoE data, given a valid API key, set of desired variables, and set of colleges using IPEDS IDs, we built college_FIPs by combining the university list we got from Reiter with state FIPs data from DoE and county FIPs data by collecting them manually university by university.
For the rest of this tutorial, we will be using the data we collected by default, but if you would like to recreate the analysis of this tutorial using a different set of colleges, and thus your own datasets, you can fork this repository and use the modules provided in the ETL/ directory to do so.
Here, we load the data we have downloaded or generated locally into our notebook for use to use in our analysis. We stored each of our datasets as csv, so they are easily loaded into Pandas Dataframes. We will also have to define a few new terms so that the reader can understand what we are talking about:
# Read dataframes from Scorecard generated data
scard_df = pd.read_csv("ETL/generated_data/ScorecardData.csv")
fips_df = pd.read_csv("ETL/generated_data/college_FIPs.csv")
cpi_df = pd.read_csv("ETL/source_data/cpi_all.csv").groupby("Year")["Value"].mean()
scard_df.head()
| student.size | cost.tuition.in_state | cost.tuition.out_of_state | cost.avg_net_price.public | cost.avg_net_price.private | id | school.name | school.carnegie_size_setting | school.zip | school.region_id | school.state_fips | school.locale | school.ownership | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5042.0 | NaN | NaN | NaN | NaN | 131159 | American University | 17 | 20016-8001 | 2 | 11 | 11 | 2 | 1996 |
| 1 | 18337.0 | NaN | NaN | NaN | NaN | 100858 | Auburn University | 15 | 36849 | 5 | 1 | 13 | 1 | 1996 |
| 2 | 10395.0 | NaN | NaN | NaN | NaN | 223232 | Baylor University | 16 | 76798 | 6 | 48 | 12 | 2 | 1996 |
| 3 | 9203.0 | NaN | NaN | NaN | NaN | 196079 | Binghamton University | 16 | 13850-6000 | 2 | 36 | 22 | 1 | 1996 |
| 4 | 10120.0 | NaN | NaN | NaN | NaN | 164924 | Boston College | 17 | 02467 | 1 | 25 | 13 | 2 | 1996 |
years = range(2009, 2020)
msa_path_format = "ETL/generated_data/MSA{y}.csv"
MSA_frames = [
pd.read_csv(msa_path_format.format(y=yr)).assign(year=yr)
for yr in years
]
msa_df = pd.concat(MSA_frames)
msa_df.columns = ["name", "msa_income", "msa_rent", "msa", "year"]
print(msa_df["name"].unique().shape)
msa_df
(36,)
| name | msa_income | msa_rent | msa | year | |
|---|---|---|---|---|---|
| 0 | CO Metro Area" | 59007 | 1207 | 19740] | 2009 |
| 1 | NC Metro Area" | 49902 | 921 | 20500] | 2009 |
| 2 | FL Metro Area" | 37654 | 881 | 23540] | 2009 |
| 3 | NC Metro Area" | 41272 | 789 | 24660] | 2009 |
| 4 | SC Metro Area" | 43283 | 763 | 24860] | 2009 |
| ... | ... | ... | ... | ... | ... |
| 56 | CA Metro Area" | 77774 | 1774 | 31080] | 2019 |
| 57 | WI Metro Area" | 75545 | 1218 | 31540] | 2019 |
| 58 | DC-VA-MD-WV Metro Area" | 105659 | 1862 | 47900] | 2019 |
| 59 | NC Metro Area" | 52322 | 815 | 49180] | 2019 |
| 60 | MA-CT Metro Area" | 76348 | 1319 | 49340]] | 2019 |
657 rows × 5 columns
years = range(2009, 2020)
puma_path_format = "ETL/generated_data/PUMA{y}.csv"
PUMA_frames = [
pd.read_csv(puma_path_format.format(y=yr)).assign(year=yr)
for yr in years
]
puma_df = pd.concat(PUMA_frames).reset_index(drop=True)
puma_df.columns = ["puma_income", "puma_rent", "state", "puma", "year"]
puma_df["state"] = puma_df["state"].astype("int")
puma_df["puma"] = puma_df["puma"].astype("int")
puma_df["year"] = puma_df["year"].astype("int")
print(puma_df["puma"].unique().shape)
puma_df
(84,)
| puma_income | puma_rent | state | puma | year | |
|---|---|---|---|---|---|
| 0 | 52023 | 1060 | 6 | 1800 | 2009 |
| 1 | 36740 | 1020 | 6 | 1901 | 2009 |
| 2 | 69677 | 1710 | 6 | 2001 | 2009 |
| 3 | 49350 | 1244 | 6 | 2101 | 2009 |
| 4 | 70522 | 1561 | 6 | 2201 | 2009 |
| ... | ... | ... | ... | ... | ... |
| 3414 | 101524 | 1921 | 51 | 1302 | 2019 |
| 3415 | 51987 | 751 | 55 | 100 | 2019 |
| 3416 | 57284 | 1254 | 55 | 101 | 2019 |
| 3417 | 61443 | 767 | 55 | 1301 | 2019 |
| 3418 | 65851 | 826 | 55 | 1600 | 2019 |
3419 rows × 5 columns
The data that we have still uses the variable names and formatting of our original sources, and those variable names are unweildy and not ideal for usage in analysis later, so we rename our columns to be more human readable and developer friendly. Additionally, cost data in our sources does not account for inflation, so we should use an all-consumers/all-goods CPI to transform our dollar values to a standard value.
# Rename columns to be more readable, usable
scard_df = scard_df.rename(
columns={
"student.size": "size",
"cost.tuition.in_state": "in_state_tuition",
"cost.tuition.out_of_state": "out_state_tuition",
"cost.avg_net_price.public": "public_net_price",
"cost.avg_net_price.private": "private_net_price",
"id": "id",
"school.name": "name",
"school.carnegie_size_setting": "size_setting",
"school.zip": "zip",
"school.state_fips": "state_fips",
"school.region_id": "region_id",
"school.locale": "locale",
"school.ownership": "ownership"
}
)
# Join county FIPs codes into College Scorecard dataframe for use later in
# associating with Census geographies.
scard_df = pd.merge(
scard_df,
fips_df[["id", "county", "cbsa", "puma"]],
on="id", how="left").drop_duplicates()
print(scard_df)
# Combine public and private net prices into a single net price column, and drop those columns
scard_df["net_cost"] = scard_df.apply(lambda row:
row["public_net_price"] if (row["ownership"] == 1) else row["private_net_price"],
axis=1
)
scard_df["net_cost_adjusted"] = scard_df.apply(lambda row:
(row["net_cost"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["in_tuition_adjusted"] = scard_df.apply(lambda row:
(row["in_state_tuition"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["out_tuition_adjusted"] = scard_df.apply(lambda row:
(row["out_state_tuition"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["non_tuition_expenses_adj"] = scard_df.apply(lambda row:
row["net_cost_adjusted"] - row["in_tuition_adjusted"],
axis=1
)
scard_df["state_fips"] = scard_df["state_fips"].astype(int)
scard_df["year"] = scard_df["year"].astype(int)
scard_df.drop(["public_net_price", "private_net_price"], axis=1, inplace=True)
scard_df.head()
size in_state_tuition out_state_tuition public_net_price \
0 5042.0 NaN NaN NaN
1 18337.0 NaN NaN NaN
2 18337.0 NaN NaN NaN
3 10395.0 NaN NaN NaN
4 9203.0 NaN NaN NaN
... ... ... ... ...
3820 7366.0 57386.0 57386.0 NaN
3821 6217.0 23628.0 46854.0 19593.0
3822 4804.0 54416.0 54416.0 NaN
3823 4701.0 57700.0 57700.0 NaN
3824 2619.0 46475.0 46475.0 NaN
private_net_price id name \
0 NaN 131159 American University
1 NaN 100858 Auburn University
2 NaN 100858 Auburn University
3 NaN 223232 Baylor University
4 NaN 196079 Binghamton University
... ... ... ...
3820 26921.0 179867 Washington University in St Louis
3821 NaN 231624 William & Mary
3822 42835.0 168421 Worcester Polytechnic Institute
3823 15296.0 130794 Yale University
3824 39536.0 197708 Yeshiva University
size_setting zip region_id state_fips locale ownership \
0 17 20016-8001 2 11 11 2
1 15 36849 5 1 13 1
2 15 36849 5 1 13 1
3 16 76798 6 48 12 2
4 16 13850-6000 2 36 22 1
... ... ... ... ... ... ...
3820 17 63130-4899 4 29 21 2
3821 14 23187-8795 5 51 23 1
3822 13 01609-2280 1 25 12 2
3823 17 06520 1 9 12 2
3824 14 10033-3299 2 36 11 2
year county cbsa puma
0 1996 1.0 47900.0 101.0
1 1996 81.0 12220.0 2101.0
2 1996 81.0 10760.0 2101.0
3 1996 309.0 47380.0 3801.0
4 1996 7.0 13780.0 2201.0
... ... ... ... ...
3820 2020 510.0 41180.0 2001.0
3821 2020 830.0 47260.0 9500.0
3822 2020 27.0 49340.0 505.0
3823 2020 9.0 35300.0 20601.0
3824 2020 61.0 35620.0 4112.0
[3825 rows x 17 columns]
| size | in_state_tuition | out_state_tuition | id | name | size_setting | zip | region_id | state_fips | locale | ownership | year | county | cbsa | puma | net_cost | net_cost_adjusted | in_tuition_adjusted | out_tuition_adjusted | non_tuition_expenses_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5042.0 | NaN | NaN | 131159 | American University | 17 | 20016-8001 | 2 | 11 | 11 | 2 | 1996 | 1.0 | 47900.0 | 101.0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 18337.0 | NaN | NaN | 100858 | Auburn University | 15 | 36849 | 5 | 1 | 13 | 1 | 1996 | 81.0 | 12220.0 | 2101.0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 18337.0 | NaN | NaN | 100858 | Auburn University | 15 | 36849 | 5 | 1 | 13 | 1 | 1996 | 81.0 | 10760.0 | 2101.0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 10395.0 | NaN | NaN | 223232 | Baylor University | 16 | 76798 | 6 | 48 | 12 | 2 | 1996 | 309.0 | 47380.0 | 3801.0 | NaN | NaN | NaN | NaN | NaN |
| 4 | 9203.0 | NaN | NaN | 196079 | Binghamton University | 16 | 13850-6000 | 2 | 36 | 22 | 1 | 1996 | 7.0 | 13780.0 | 2201.0 | NaN | NaN | NaN | NaN | NaN |
locale_map = {
11: "City: Large",
12: "City: Midsize",
13: "City: Small",
21: "Suburb: Large",
22: "Suburb: Midsize",
23: "Suburb: Small",
31: "Town: Fringe",
32: "Town: Distant",
33: "Town: Remote",
41: "Rural: Fringe",
42: "Rural: Distant",
43: "Rural: Remote"
}
ownership_map = {
1: "Public",
2: "Private nonprofit",
3: "Private for-profit"
}
region_map = {
0: "U.S. Service Schools",
1: "New England",
2: "Mid East",
3: "Great Lakes",
4: "Plains",
5: "Southeast",
6: "Southwest",
7: "Rocky Mountains",
8: "Far West",
9: "Outlying Areas"
}
scard_df["region"] = scard_df.apply(lambda row:
region_map[row["region_id"]],
axis=1
)
scard_df["ownership"] = scard_df.apply(lambda row:
ownership_map[row["ownership"]],
axis=1
)
scard_df["locale"] = scard_df.apply(lambda row:
locale_map[row["locale"]],
axis=1
)
We can note that some rows do not have cost data associated with them, thus they are missing data. Since we will use this cost data later in our analysis, we need to either interpolate the missing data or drop the invalid rows. Here, we experiment with dropping rows with missing data.
scard_clipped = scard_df.dropna(subset=["net_cost", "in_state_tuition", "out_state_tuition"]).copy()
#print(scard_clipped["name"].unique().shape)
#print(scard_clipped.to_string())
It seems as if the clipped dataframe after dropping null cost data is just the data after 2009. To verify that this is true, I try querying the original dataset purely by restricting the years. If there is complete cost data from 2009 to 2020, then the resulting dataframe should be equal to the dataframe resulting from dropping null data. Run the next code cell to confirm this.
scard_clipped_year = scard_df[scard_df["year"] >= 2009].copy()
scard_clipped_year.equals(scard_clipped)
True
msa_df["msa"] = msa_df["msa"].str.replace("]", "").astype(int)
msa_df
/tmp/ipykernel_2287/1876715125.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
| name | msa_income | msa_rent | msa | year | |
|---|---|---|---|---|---|
| 0 | CO Metro Area" | 59007 | 1207 | 19740 | 2009 |
| 1 | NC Metro Area" | 49902 | 921 | 20500 | 2009 |
| 2 | FL Metro Area" | 37654 | 881 | 23540 | 2009 |
| 3 | NC Metro Area" | 41272 | 789 | 24660 | 2009 |
| 4 | SC Metro Area" | 43283 | 763 | 24860 | 2009 |
| ... | ... | ... | ... | ... | ... |
| 56 | CA Metro Area" | 77774 | 1774 | 31080 | 2019 |
| 57 | WI Metro Area" | 75545 | 1218 | 31540 | 2019 |
| 58 | DC-VA-MD-WV Metro Area" | 105659 | 1862 | 47900 | 2019 |
| 59 | NC Metro Area" | 52322 | 815 | 49180 | 2019 |
| 60 | MA-CT Metro Area" | 76348 | 1319 | 49340 | 2019 |
657 rows × 5 columns
scard_clipped_geo = scard_clipped.dropna(subset=["cbsa", "puma"]).copy()
scard_clipped_geo
| size | in_state_tuition | out_state_tuition | id | name | size_setting | zip | region_id | state_fips | locale | ... | year | county | cbsa | puma | net_cost | net_cost_adjusted | in_tuition_adjusted | out_tuition_adjusted | non_tuition_expenses_adj | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1989 | 6430.0 | 34973.0 | 34973.0 | 131159 | American University | 17 | 20016-8001 | 2 | 11 | City: Large | ... | 2009 | 1.0 | 47900.0 | 101.0 | 40345.0 | 18803.189093 | 16299.514987 | 16299.514987 | 2503.674106 | Mid East |
| 1990 | 19918.0 | 6972.0 | 19452.0 | 100858 | Auburn University | 15 | 36849 | 5 | 1 | City: Small | ... | 2009 | 81.0 | 12220.0 | 2101.0 | 13225.0 | 6163.642973 | 3249.370042 | 9065.798345 | 2914.272931 | Southeast |
| 1991 | 19918.0 | 6972.0 | 19452.0 | 100858 | Auburn University | 15 | 36849 | 5 | 1 | City: Small | ... | 2009 | 81.0 | 10760.0 | 2101.0 | 13225.0 | 6163.642973 | 3249.370042 | 9065.798345 | 2914.272931 | Southeast |
| 1992 | 12101.0 | 28070.0 | 28070.0 | 223232 | Baylor University | 16 | 76798 | 6 | 48 | City: Midsize | ... | 2009 | 309.0 | 47380.0 | 3801.0 | 24174.0 | 11266.533477 | 13082.303082 | 13082.303082 | -1815.769605 | Southwest |
| 1993 | 11635.0 | 6761.0 | 14661.0 | 196079 | Binghamton University | 16 | 13850-6000 | 2 | 36 | Suburb: Midsize | ... | 2009 | 7.0 | 13780.0 | 2201.0 | 13999.0 | 6524.373382 | 3151.031391 | 6832.905076 | 3373.341992 | Mid East |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3820 | 7366.0 | 57386.0 | 57386.0 | 179867 | Washington University in St Louis | 17 | 63130-4899 | 4 | 29 | Suburb: Large | ... | 2020 | 510.0 | 41180.0 | 2001.0 | 26921.0 | 10400.208357 | 22169.546331 | 22169.546331 | -11769.337974 | Plains |
| 3821 | 6217.0 | 23628.0 | 46854.0 | 231624 | William & Mary | 14 | 23187-8795 | 5 | 51 | Suburb: Small | ... | 2020 | 830.0 | 47260.0 | 9500.0 | 19593.0 | 7569.231542 | 9128.045877 | 18100.789806 | -1558.814335 | Southeast |
| 3822 | 4804.0 | 54416.0 | 54416.0 | 168421 | Worcester Polytechnic Institute | 13 | 01609-2280 | 1 | 25 | City: Midsize | ... | 2020 | 27.0 | 49340.0 | 505.0 | 42835.0 | 16548.156642 | 21022.166263 | 21022.166263 | -4474.009620 | New England |
| 3823 | 4701.0 | 57700.0 | 57700.0 | 130794 | Yale University | 17 | 06520 | 1 | 9 | City: Midsize | ... | 2020 | 9.0 | 35300.0 | 20601.0 | 15296.0 | 5909.200514 | 22290.851833 | 22290.851833 | -16381.651319 | New England |
| 3824 | 2619.0 | 46475.0 | 46475.0 | 197708 | Yeshiva University | 14 | 10033-3299 | 2 | 36 | City: Large | ... | 2020 | 61.0 | 35620.0 | 4112.0 | 39536.0 | 15273.676223 | 17954.373292 | 17954.373292 | -2680.697069 | Mid East |
1656 rows × 21 columns
scard_clipped_geo["puma"] = scard_clipped_geo["puma"].astype(int)
scard_clipped_geo["cbsa"] = scard_clipped_geo["cbsa"].astype(int)
print(scard_clipped_geo["puma"].unique().shape)
scard_clipped_geo
(111,)
| size | in_state_tuition | out_state_tuition | id | name | size_setting | zip | region_id | state_fips | locale | ... | year | county | cbsa | puma | net_cost | net_cost_adjusted | in_tuition_adjusted | out_tuition_adjusted | non_tuition_expenses_adj | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1989 | 6430.0 | 34973.0 | 34973.0 | 131159 | American University | 17 | 20016-8001 | 2 | 11 | City: Large | ... | 2009 | 1.0 | 47900 | 101 | 40345.0 | 18803.189093 | 16299.514987 | 16299.514987 | 2503.674106 | Mid East |
| 1990 | 19918.0 | 6972.0 | 19452.0 | 100858 | Auburn University | 15 | 36849 | 5 | 1 | City: Small | ... | 2009 | 81.0 | 12220 | 2101 | 13225.0 | 6163.642973 | 3249.370042 | 9065.798345 | 2914.272931 | Southeast |
| 1991 | 19918.0 | 6972.0 | 19452.0 | 100858 | Auburn University | 15 | 36849 | 5 | 1 | City: Small | ... | 2009 | 81.0 | 10760 | 2101 | 13225.0 | 6163.642973 | 3249.370042 | 9065.798345 | 2914.272931 | Southeast |
| 1992 | 12101.0 | 28070.0 | 28070.0 | 223232 | Baylor University | 16 | 76798 | 6 | 48 | City: Midsize | ... | 2009 | 309.0 | 47380 | 3801 | 24174.0 | 11266.533477 | 13082.303082 | 13082.303082 | -1815.769605 | Southwest |
| 1993 | 11635.0 | 6761.0 | 14661.0 | 196079 | Binghamton University | 16 | 13850-6000 | 2 | 36 | Suburb: Midsize | ... | 2009 | 7.0 | 13780 | 2201 | 13999.0 | 6524.373382 | 3151.031391 | 6832.905076 | 3373.341992 | Mid East |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3820 | 7366.0 | 57386.0 | 57386.0 | 179867 | Washington University in St Louis | 17 | 63130-4899 | 4 | 29 | Suburb: Large | ... | 2020 | 510.0 | 41180 | 2001 | 26921.0 | 10400.208357 | 22169.546331 | 22169.546331 | -11769.337974 | Plains |
| 3821 | 6217.0 | 23628.0 | 46854.0 | 231624 | William & Mary | 14 | 23187-8795 | 5 | 51 | Suburb: Small | ... | 2020 | 830.0 | 47260 | 9500 | 19593.0 | 7569.231542 | 9128.045877 | 18100.789806 | -1558.814335 | Southeast |
| 3822 | 4804.0 | 54416.0 | 54416.0 | 168421 | Worcester Polytechnic Institute | 13 | 01609-2280 | 1 | 25 | City: Midsize | ... | 2020 | 27.0 | 49340 | 505 | 42835.0 | 16548.156642 | 21022.166263 | 21022.166263 | -4474.009620 | New England |
| 3823 | 4701.0 | 57700.0 | 57700.0 | 130794 | Yale University | 17 | 06520 | 1 | 9 | City: Midsize | ... | 2020 | 9.0 | 35300 | 20601 | 15296.0 | 5909.200514 | 22290.851833 | 22290.851833 | -16381.651319 | New England |
| 3824 | 2619.0 | 46475.0 | 46475.0 | 197708 | Yeshiva University | 14 | 10033-3299 | 2 | 36 | City: Large | ... | 2020 | 61.0 | 35620 | 4112 | 39536.0 | 15273.676223 | 17954.373292 | 17954.373292 | -2680.697069 | Mid East |
1656 rows × 21 columns
scard_df = scard_clipped_geo
scard_df = pd.merge(scard_df, puma_df, left_on=["puma", "year"], right_on=["puma", "year"], how="inner")
scard_df["name"].unique().shape
(70,)
scard_df = pd.merge(scard_df, msa_df, left_on=["cbsa", "year"], right_on=["msa", "year"], how="left")
scard_df["name_y"].unique().shape
(33,)
In this section, we will be analyzing different components of our dataset graphically to display relationships present within it. For example, in this next block, we look at a violin plot of the net cost of every university and how it changes every year. Since the purpose of a violin plot is to visualize a distribute of data, we are able to see the increasing volatility in the net cost of university as time goes on.
fig = go.Figure()
fig.add_trace(
go.Violin(
x=scard_df['year'],
y=scard_df['net_cost'],
box_visible=True,
meanline_visible=True
)
)
fig.show()
We can observe that cost seems to be a bimodal distribution of cost, implying multple factors with significant variance. Lets keep exploring the data and see if this holds for other costs.
fig = go.Figure()
fig.add_trace(
go.Violin(
x=scard_df['year'],
y=scard_df['in_state_tuition'],
box_visible=True,
meanline_visible=True
)
)
fig.show()
fig = go.Figure()
fig.add_trace(
go.Violin(
x=scard_df['year'],
y=scard_df['out_state_tuition'],
box_visible=True,
meanline_visible=True
)
)
fig.show()
It seems as if this binomial distribution exists for all of our cost data for tuition. Let's look at whether public or private institutions behave differently.
fig = px.scatter(scard_df, x="year", y="in_state_tuition", facet_col="ownership")
fig.show()
fig = px.scatter(scard_df, x="year", y="in_tuition_adjusted", facet_col="ownership")
fig.show()
fig = px.scatter(scard_df, x="year", y="out_tuition_adjusted", facet_col="ownership")
fig.show()
Aha! It seems as if we have determined that there is a relationship between public and private ownership and net costs, as they visibly vary differntly. We can explore this in our analysis later. Next, lets look at our census data. MSA groupings are major geographical items for the census, and can be useful to compare against smaller items.
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
go.Violin(x=msa_df["year"],
y=msa_df["msa_income"],
box_visible=True,
meanline_visible=True
),
row=1, col=1
)
fig.add_trace(
go.Violin(x=msa_df["year"],
y=msa_df["msa_rent"],
box_visible=True,
meanline_visible=True
),
row=1, col=2
)
fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Year", row=1, col=2)
fig.update_yaxes(title_text="Median Household Income", row=1, col=1)
fig.update_yaxes(title_text="Median Monthly Rent ", row=1, col=2)
fig.update_layout(title_text="Household Income vs. Monthly Rent Over Time for MSA Adjusted for Inflation")
fig.show()
It seems as if rent and income across our MSA's are mostly normally distributed within years, and are varying with time. Let's do the same with PUMAs, smaller geographical items for the census and more representative of the immediate relevant area around our colleges.
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
go.Violin(x=puma_df["year"],
y=puma_df["puma_income"],
box_visible=True,
meanline_visible=True
),
row=1, col=1
)
fig.add_trace(
go.Violin(x=puma_df["year"],
y=puma_df["puma_rent"],
box_visible=True,
meanline_visible=True
),
row=1, col=2
)
fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Year", row=1, col=2)
fig.update_yaxes(title_text="Median Household Income", row=1, col=1)
fig.update_yaxes(title_text="Median Monthly Rent", row=1, col=2)
fig.update_layout(title_text="Household Income vs. Monthly Rent Over Time for PUMA")
fig.show()
The same holds true for PUMAs! However, it's possible that inflation accounts for our yearly variance, lets normalize using an all items CPI and try again.
msa_df["msa_income_adj"] = msa_df.apply(lambda row:
(row["msa_income"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
msa_df["msa_rent_adj"] = msa_df.apply(lambda row:
(row["msa_rent"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["msa_income_adj"] = scard_df.apply(lambda row:
(row["msa_income"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["msa_rent_adj"] = scard_df.apply(lambda row:
(row["msa_rent"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["msa_rent_adj_ytd"] = scard_df.apply(lambda row:
(row["msa_rent"]/cpi_df.at[row["year"]]) * 100 * 12,
axis=1
)
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
go.Violin(x=msa_df["year"],
y=msa_df["msa_income_adj"],
box_visible=True,
meanline_visible=True
),
row=1, col=1
)
fig.add_trace(
go.Violin(x=msa_df["year"],
y=msa_df["msa_rent_adj"],
box_visible=True,
meanline_visible=True
),
row=1, col=2
)
fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Year", row=1, col=2)
fig.update_yaxes(title_text="Median Household Income", row=1, col=1)
fig.update_yaxes(title_text="Median Monthly Rent ", row=1, col=2)
fig.update_layout(title_text="Household Income vs. Monthly Rent Over Time for MSA Adjusted for Inflation")
fig.show()
puma_df["puma_income_adj"] = puma_df.apply(lambda row:
(row["puma_income"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
puma_df["puma_rent_adj"] = puma_df.apply(lambda row:
(row["puma_rent"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["puma_income_adj"] = scard_df.apply(lambda row:
(row["puma_income"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["puma_rent_adj"] = scard_df.apply(lambda row:
(row["puma_rent"]/cpi_df.at[row["year"]]) * 100,
axis=1
)
scard_df["puma_rent_adj_ytd"] = scard_df.apply(lambda row:
(row["puma_rent"]/cpi_df.at[row["year"]]) * 100 * 12,
axis=1
)
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
go.Violin(x=puma_df["year"],
y=puma_df["puma_income_adj"],
box_visible=True,
meanline_visible=True
),
row=1, col=1
)
fig.add_trace(
go.Violin(x=puma_df["year"],
y=puma_df["puma_rent_adj"],
box_visible=True,
meanline_visible=True
),
row=1, col=2
)
fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Year", row=1, col=2)
fig.update_yaxes(title_text="Median Household Income", row=1, col=1)
fig.update_yaxes(title_text="Median Monthly Rent", row=1, col=2)
fig.update_layout(title_text="Household Income vs. Monthly Rent Over Time for PUMA Adjusted for Inflation")
fig.show()
There seems to be some fluctuation over time now, but it seems siginificantly less important now. What about the relationship between rent and and nearby university costs? Let's plot the relevant data and break it down by ownership of nearby institutions to see if the areas around public and private institutions behave differently.
fig = px.scatter(scard_df, x="net_cost_adjusted", y="msa_rent_adj_ytd", trendline="ols", facet_col="ownership")
fig.show()
The trend lines are different, so it does seem that our relationship for ownership also seems to affect rent data as well. Lets see if our distribution of rents across our PUMAs is normal for public and private institutions.
fig = px.violin(scard_df, x="ownership", y="puma_rent_adj_ytd")
fig.show()
It seems as if there is some other factor flattening the distribution for PUMAs surrounding private institutions. Lets look at locality and the density of the surrounding area to see if there is a relationship there.
fig = px.violin(scard_df, x="locale", y="puma_rent_adj_ytd")
fig.show()
While rent across different localities are mostly monopolar, rural and remote towns are bimodal, and larger localities are slightly verically skewed.
Since monthly rent appears to increase linearly proportionately to household income, I will be using a Least Squares Linear Regression to highlight this relationship.
fig = px.scatter(msa_df, x="msa_rent_adj", y="msa_income_adj", trendline="ols")
fig.show()
results = px.get_trendline_results(fig)
print(results.px_fit_results.iloc[0].summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.810
Model: OLS Adj. R-squared: 0.810
Method: Least Squares F-statistic: 2797.
Date: Fri, 19 May 2023 Prob (F-statistic): 1.32e-238
Time: 17:34:17 Log-Likelihood: -6045.3
No. Observations: 657 AIC: 1.209e+04
Df Residuals: 655 BIC: 1.210e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 6634.8486 361.556 18.351 0.000 5924.901 7344.796
x1 38.8033 0.734 52.891 0.000 37.363 40.244
==============================================================================
Omnibus: 5.072 Durbin-Watson: 1.704
Prob(Omnibus): 0.079 Jarque-Bera (JB): 5.041
Skew: -0.215 Prob(JB): 0.0804
Kurtosis: 3.006 Cond. No. 1.90e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.9e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
fig = px.scatter(puma_df, x="puma_rent_adj", y="puma_income_adj", trendline="ols")
fig.show()
results = px.get_trendline_results(fig)
print(results.px_fit_results.iloc[0].summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.716
Model: OLS Adj. R-squared: 0.716
Method: Least Squares F-statistic: 8616.
Date: Fri, 19 May 2023 Prob (F-statistic): 0.00
Time: 17:34:17 Log-Likelihood: -33340.
No. Observations: 3419 AIC: 6.668e+04
Df Residuals: 3417 BIC: 6.670e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5487.7969 211.505 25.946 0.000 5073.108 5902.486
x1 41.7831 0.450 92.823 0.000 40.901 42.666
==============================================================================
Omnibus: 9.532 Durbin-Watson: 1.979
Prob(Omnibus): 0.009 Jarque-Bera (JB): 10.977
Skew: -0.060 Prob(JB): 0.00413
Kurtosis: 3.250 Cond. No. 1.40e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.4e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
As can be seen by the OLS Regression Summary, the MSA plot has a much more accurate regression with an R-squared of 0.814 compared to PUMA's 0.721, but this can be explained by PUMA having significantly more data points (PUMAs are more specific than MSAs) and the variance among the data is significantly wider (For example, the monthly rent for MSA does not even exceed 1000, but the montly rent for PUMA goes well above 1100). This, as well as many other factors, shows a strong tendency for universities to affect pricing much more on a small scale, such as a city level, than a county or metropolitan-center-wide scale.
Lets try to combine the relationships we found between cost, ownership, and locality in our EDA into a model that describes the relationship between our colleges and the Public Use Microdata Areas and MSAs surrounding them. To do this, we will make multiple models that represent different possible relationships, and then fit those models to our data. This next cell will generate and fit our models.
puma_rent_model = smf.ols(formula="puma_rent_adj ~ year * ownership + in_tuition_adjusted", data=scard_df)
puma_rent_res = puma_rent_model.fit()
msa_rent_model = smf.ols(formula="msa_rent_adj ~ year * ownership + in_tuition_adjusted", data=scard_df)
msa_rent_res = msa_rent_model.fit()
prm2 = smf.ols(formula="puma_rent_adj ~ year * ownership", data=scard_df)
prm2_res = prm2.fit()
prm3 = smf.ols(formula="puma_rent_adj ~ year * ownership + locale", data=scard_df)
prm3_res = prm3.fit()
num = len(scard_df["year"].values)
df_samples = scard_df[["year", "ownership", "in_tuition_adjusted"]]
df_samples = sm.add_constant(df_samples)
sample_set = np.column_stack(
(
np.ones(num),
scard_df["year"].values,
scard_df["ownership"].values,
scard_df["in_tuition_adjusted"].values,
)
)
df_samples2 = scard_df[["year", "ownership"]]
df_samples2 = sm.add_constant(df_samples)
sample_set2 = np.column_stack(
(
np.ones(num),
scard_df["year"].values,
scard_df["ownership"].values,
)
)
scard_df["puma_rent_fit"] = puma_rent_res.predict(df_samples)
scard_df["puma_rent_fit2"] = prm2_res.predict(df_samples)
scard_df["msa_rent_fit"] = msa_rent_res.predict(df_samples)
Lets plot the first model we created, assuming a relationship between year, cost, and ownership in determining rents.
fig = make_subplots(rows=2, cols=1)
fig.update_layout(height=800, width=800, title_text="Interaction Model Public Use MicroData Area Rent Distribution and Trend Lines")
for i,id in enumerate (scard_df["ownership"].unique()):
sub_df = scard_df[scard_df["ownership"] == id].copy().sort_values(by="year")
trend_df = sub_df.groupby("year")["puma_rent_fit"].mean()
fig.add_trace(
go.Violin(
x=sub_df['year'],
y=sub_df['puma_rent_adj'],
name=id,
),
row=(i+1), col=1
)
fig.add_trace(
go.Scatter(
x=trend_df.index,
y=trend_df,
name=id,
),
row=(i+1), col=1
)
fig.show()
The variancy by years is negligible, but it seems as if the rents for public university communities are trending downwards, and private, upwards. Lets look at our MSA data as well to see if this is reflected in broader geographical regions.
fig = make_subplots(rows=2, cols=1)
fig.update_layout(height=800, width=800, title_text="Interaction Model Metropolitan Statistical Area Rent Distribution and Trend Lines")
for i,id in enumerate (scard_df["ownership"].unique()):
sub_df = scard_df[scard_df["ownership"] == id].copy().sort_values(by="year")
trend_df = sub_df.groupby("year")["msa_rent_fit"].mean()
fig.add_trace(
go.Violin(
x=sub_df['year'],
y=sub_df['msa_rent_adj'],
name=id,
),
row=(i+1), col=1
)
fig.add_trace(
go.Scatter(
x=trend_df.index,
y=trend_df,
name=id,
),
row=(i+1), col=1
)
fig.show()
Interestingly, the prior relationship is inverted, and the private data seems to be somewhat bimodal, implying we are missing a variable. From now on, lets analyze our OLS summaries and p values for significance.
print(puma_rent_res.summary())
print(puma_rent_res.pvalues)
OLS Regression Results
==============================================================================
Dep. Variable: puma_rent_adj R-squared: 0.003
Model: OLS Adj. R-squared: 0.002
Method: Least Squares F-statistic: 3.598
Date: Fri, 19 May 2023 Prob (F-statistic): 0.00619
Time: 17:34:17 Log-Likelihood: -31199.
No. Observations: 4777 AIC: 6.241e+04
Df Residuals: 4772 BIC: 6.244e+04
Df Model: 4
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 678.5107 2084.476 0.326 0.745 -3408.024 4765.046
ownership[T.Public] 4648.0332 3175.094 1.464 0.143 -1576.615 1.09e+04
year -0.1306 1.038 -0.126 0.900 -2.165 1.904
year:ownership[T.Public] -2.2918 1.577 -1.453 0.146 -5.384 0.801
in_tuition_adjusted 0.0013 0.001 1.402 0.161 -0.001 0.003
==============================================================================
Omnibus: 1126.099 Durbin-Watson: 1.669
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2423.724
Skew: 1.361 Prob(JB): 0.00
Kurtosis: 5.183 Cond. No. 2.15e+07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.15e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
Intercept 0.744812
ownership[T.Public] 0.143286
year 0.899891
year:ownership[T.Public] 0.146330
in_tuition_adjusted 0.160856
dtype: float64
Our first model analyzing cost, years, and ownership has a good F statistic of 0.006, implying that the model that we have is significant,however, the pvalues of our variables individually do not reach the level of significance of p: < 0.05. Lets try a different model, this time without tuition costs.
print(prm2_res.summary())
print(prm2_res.pvalues)
OLS Regression Results
==============================================================================
Dep. Variable: puma_rent_adj R-squared: 0.003
Model: OLS Adj. R-squared: 0.002
Method: Least Squares F-statistic: 4.141
Date: Fri, 19 May 2023 Prob (F-statistic): 0.00610
Time: 17:34:17 Log-Likelihood: -31200.
No. Observations: 4777 AIC: 6.241e+04
Df Residuals: 4773 BIC: 6.243e+04
Df Model: 3
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept -276.5553 1970.291 -0.140 0.888 -4139.234 3586.123
ownership[T.Public] 5307.0927 3140.441 1.690 0.091 -849.619 1.15e+04
year 0.3556 0.978 0.364 0.716 -1.562 2.273
year:ownership[T.Public] -2.6279 1.559 -1.685 0.092 -5.685 0.429
==============================================================================
Omnibus: 1120.860 Durbin-Watson: 1.669
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2407.705
Skew: 1.356 Prob(JB): 0.00
Kurtosis: 5.178 Cond. No. 3.14e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.14e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Intercept 0.888379
ownership[T.Public] 0.091109
year 0.716213
year:ownership[T.Public] 0.091998
dtype: float64
This model with only years and ownership provides a slightly better F statistic, thus is slightly more predictive, and its variable's p values are closer to 0.05, however, they still do not reach the level of clear statistical significance. This seems to be the right track though. Lets see what happens when incorporating locale into our model.
print(prm3_res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: puma_rent_adj R-squared: 0.076
Model: OLS Adj. R-squared: 0.074
Method: Least Squares F-statistic: 43.48
Date: Fri, 19 May 2023 Prob (F-statistic): 1.50e-75
Time: 17:34:17 Log-Likelihood: -31017.
No. Observations: 4777 AIC: 6.205e+04
Df Residuals: 4767 BIC: 6.212e+04
Df Model: 9
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
Intercept 1367.7492 1900.237 0.720 0.472 -2357.592 5093.090
ownership[T.Public] 6119.0853 3030.270 2.019 0.044 178.357 1.21e+04
locale[T.City: Midsize] -15.0368 7.507 -2.003 0.045 -29.755 -0.319
locale[T.City: Small] -87.6213 6.972 -12.568 0.000 -101.289 -73.954
locale[T.Rural: Fringe] 14.5834 37.292 0.391 0.696 -58.526 87.693
locale[T.Suburb: Large] -4.5794 8.502 -0.539 0.590 -21.247 12.088
locale[T.Suburb: Midsize] -79.0237 7.582 -10.423 0.000 -93.887 -64.160
locale[T.Town: Remote] -155.8319 11.912 -13.082 0.000 -179.186 -132.478
year -0.4390 0.943 -0.465 0.642 -2.289 1.410
year:ownership[T.Public] -3.0348 1.505 -2.017 0.044 -5.985 -0.085
==============================================================================
Omnibus: 1146.717 Durbin-Watson: 1.790
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2518.813
Skew: 1.375 Prob(JB): 0.00
Kurtosis: 5.258 Cond. No. 3.14e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.14e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
print(prm3_res.pvalues)
Intercept 4.716967e-01 ownership[T.Public] 4.350981e-02 locale[T.City: Midsize] 4.523786e-02 locale[T.City: Small] 1.154238e-35 locale[T.Rural: Fringe] 6.957702e-01 locale[T.Suburb: Large] 5.901630e-01 locale[T.Suburb: Midsize] 3.624453e-25 locale[T.Town: Remote] 1.913221e-38 year 6.416813e-01 year:ownership[T.Public] 4.375242e-02 dtype: float64
As can be seen by the OLS Regression Summary, this model incorporating locale, ownership, and year has a much more accurate regression with a near zero F statistic, and with almost all of its variables crossing the threshold into statistical significance. This would suggest that the interaction between rents in the region surrounding a college, and the characteristics of that college and region, vary strongly based on the locality of the university, and the ownership of the college. This makes sense going by locality, as a university in a major city might increase rent, but will overall have a less drastic effect, wheras the significance of the city will also likely mean more universities will be hosted there, leading to a smaller coefficient. On the other hand, a university setting up shop in a small town would much more significantly increase the housing demand and value in the area, leading to a larger coefficient.
Based on our observations throughout our project, we can conclude:
If you would like to learn more about the tools we used to make this: